Load the data set you exported in the final Task of Case Study 2. Eliminate all observations with missing values in the income status variable.
df <- read.csv('./final_dataset.csv')
nrow(df) # Result is 175
## [1] 175
# Eliminate missing values in the income status variable
df <- df[!is.na(df$income_group), ]
nrow(df) # Result is the same 175, since we removed all rows consisting of at least one NA already in task 2
## [1] 175
str(df)
## 'data.frame': 175 obs. of 7 variables:
## $ country : chr "Monaco" "Japan" "Germany" "Italy" ...
## $ median_age : num 55.4 48.6 47.8 46.5 45.6 45.3 45.2 44.9 44.6 44.6 ...
## $ net_migration_rate : num 8.3 0 1.5 3.2 1.7 0.9 6.6 1.5 5.2 0.3 ...
## $ youth_unemployed_rate: num 26.6 3.6 6.2 32.2 8.7 39.9 27.4 8.8 10.1 20.3 ...
## $ income_group : chr "high income" "high income" "high income" "high income" ...
## $ region : chr "western europe" "eastern asia" "western europe" "southern europe" ...
## $ continent : chr "europe" "asia" "europe" "europe" ...
Using ggplot2, create a density plot of the median age grouped by
income status groups. The densities for the different groups are
superimposed in the same plot rather than in different plots. Ensure
that you order the levels of the income status such that in the plots
the legend is ordered from High (H) to Low (L).
• The color of the density lines is black.
• The area under the density curve should be colored differently among
the income status levels.
• For the colors, choose a transparency level of 0.5 for better
visibility.
• Position the legend at the top center of the plot and give it no title
(hint: use element_blank()).
• Rename the x axis as “Median age of population”
Comment briefly on the plot.
As shown in the plot below, there is a significant peak in the density plot of the low income countries with a density higher than 0.15, referring to a very young median age of the population of approximately 18-19 years. Further it can be seen that the lower middle income countries have their highest density of 25 years, upper middle income countries of 30 years and high income countries have their climax at around 43 years. Especially in the high income density plot, we can see a right-skewed curve, which means the mean is greater than the median. This leads to the interpretation, that as higher the income of a country as higher is the median age of its citizens.
df$income_group <- factor(df$income_group, levels = c("high income", "upper middle income", "lower middle income", "low income"))
plot <- ggplot(df, aes(x = median_age, fill = income_group) ) +
geom_density(alpha = 0.5, color = "black") +
labs(x = "Median age of population") +
theme_minimal()+
theme(legend.position = "top",
legend.title = element_blank())
plot
Investigate how the income status is distributed in the different
continents.
• Using ggplot2, create a stacked barplot of absolute frequencies
showing how the entities are split into continents and income status.
Comment the plot.
• Create another stacked barplot of relative frequencies (height of the
bars should be one). Comment the plot.
• Create a mosaic plot of continents and income status using base R
functions.
• Briefly comment on the differences between the three plots generated
to investigate the income distribution among the different
continents.
As shown in this plot, Africa, Asia and Europe have the highest absolute frequencies of countries of approximately 40 countries each. Africa consists mostly of low income and lower middle income countries. Asia almost equally consists of lower middle, upper middle und high income countries. Whereas Europe mainly consists of high income countries and around one fifth of upper middle income countries. Neither in Europe, North America, Oceania nor in South America is a low income country placed.
# Aggregate data to get frequencies
df_agg <- df %>%
count(continent, income_group)
ggplot(df_agg, aes(x = continent, y = n, fill = income_group)) +
geom_bar(stat = "identity") +
labs(title = "Absolute Frequencies of Income Status by Continent",
x = "Continent", y = "Absolute Frequency") +
theme_minimal()+
theme(legend.position = "top",
legend.title = element_blank())
• Create another stacked barplot of relative frequencies (height of the bars should be one). Comment the plot.
With the relative frequency is given, it is easy to compare each income status per continent with each other. This can be done by providing the percentage of each country of a specific income group. For example South America has the biggest proportion of upper middle income countries than any other region, which is difficult to see in the absolute frequency plot. Also Asia, Oceania and South America have a similar proportion of high income countries.
# Calculate relative frequencies
df_relative <- df_agg %>%
group_by(continent) %>%
mutate(relative_frequency = n / sum(n))
ggplot(df_relative, aes(x = continent, y = relative_frequency, fill = income_group)) +
geom_bar(stat = "identity") +
labs(title = "Relative Frequencies of Income Status by Continent",
x = "Continent", y = "Relative Frequency") +
theme_minimal()+
theme(legend.position = "top",
legend.title = element_blank())
• Create a mosaic plot of continents and income status using base R functions.
# Create a contingency table
contingency_table <- table(df$continent, df$income_group)
# Mosaic plot
mosaicplot(contingency_table, main = "Mosaic Plot of Continents and Income Status", shade = TRUE, color = TRUE, las=2)
• Briefly comment on the differences between the three plots generated
to investigate the income distribution among the different
continents.
With a stacked barplot of absolute frequencies the total amount can be investigated quickly between each continent. This makes it easy to analyze exact numbers of countries per continent but harder to compare continents by their income group with a different amount of countries. The stacked barplot of relative frequencies conveys a better understanding of occurrences of each income group per continent. This makes it far easier to compare continents with a significant difference of total countries by their income groups. The mosaic plot is not as intuitively as the other two plots. First, it can be easily observed that the distributions are the same of the relative frequency stacked barplot. The add-on is now the colouring of each rectangle with a meaning namely representing the standardized residual. As shown on the legend, the standardized residual with a positive value means that the observed count is higher than the expected count i.e. blue rectangles or white rectangles with a solid line this. The same meaning goes the other way around for negative residuals. This gives us the insight that especially Europe with their amount of high income countries and Africa with their amount of low income countries significantly differfrom what would be expected under the assumption of independence.
For Asia, investigate further how the income status distribution is in the different subcontinents. Use one of the plots in b. for this purpose. Comment on the results.
# Filter the dataset for Asian countries
asia_subset <- subset(df, continent == "asia")
# Calculate relative frequencies
asia_data_rel <- asia_subset %>%
group_by(region, income_group) %>%
summarise(count = n()) %>%
mutate(freq = count / sum(count))
## `summarise()` has grouped output by 'region'. You can override using the
## `.groups` argument.
# Create a bar plot of income group distribution by region
ggplot(asia_data_rel, aes(x = region, y = freq, fill = income_group)) +
geom_bar(stat = "identity") +
labs(title = "Relative Frequency of Income Group by Region in Asia",
x = "Region",
y = "Relative Frequency",
fill = "Income Group") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Central asia, South-eastern asia and southern asia have mostly lower middle income countries. Even though for Central asia this is not totally correct since it has just as many upper middle income countries as well. Eastern asia and western asia on the opposite have mainly high income countries. Only western asia consists of all income groups whereas the other subcontinents have at least one (south-eastern asia and souther asia) or two income groups (central asia and eastern asia) missing.
• Using ggplot2, create parallel boxplots showing the distribution of
the net migration rate in the different continents.
• Prettify the plot (change y-, x-axis labels, etc).
• Identify which country in Asia constitutes the largest negative
outlier and which country in Asia constitutes the largest positive
outlier.
• Comment on the plot.
ggplot(df, aes(x = continent, y = net_migration_rate)) +
geom_boxplot(outlier.color = "red", outlier.shape = 16, outlier.size = 2) +
labs(title = "Distribution of Net Migration Rate by Continent",
x = "Continent",
y = "Net Migration Rate") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
In the plot we can observe Oceania has the highest distribution of net migration rate with the biggest interquartile range and the longest whiskers. 50% of its data is approximately between 0 and -7.5 and its median of -3.5. Europe and Asia also have a higher interquartile range in terms of the net migration rate. Especially Asia stands out with a lot of outliers. On the other hand Africa, North America and South America all have relatively small boxplots around zero meaning that their interquartile range is not as high besides several outliers. Especially for South America it seems that almost every country has a net migration of 0 with only one outlier, which is still quite close to the interquartile range at approximately -2.
# Identify the largest negative and positive outliers in Asia
asia_data <- df %>% filter(continent == "asia")
largest_negative_outlier <- asia_data %>% filter(net_migration_rate == min(net_migration_rate))
largest_positive_outlier <- asia_data %>% filter(net_migration_rate == max(net_migration_rate))
largest_negative_outlier
## country median_age net_migration_rate youth_unemployed_rate
## 1 Maldives 29.5 -12.7 15.9
## income_group region continent
## 1 upper middle income southern asia asia
largest_positive_outlier
## country median_age net_migration_rate youth_unemployed_rate income_group
## 1 Syria 23.5 27.1 35.8 low income
## region continent
## 1 western asia asia
Here, we can observe that the island state Maldives have the lowest net_migration_rate of -12,7 and is therefore the largest negative outlier. Syria on the other side has the highest net_migration_rate of 27.1 and is therefore the largest positive outlier. For both countries it seems reasonable that they stay as outliers since the Maldives have a strong economy relying on tourism, which attracts many people to work there and leads to a negative migration rate. For Syria it is known that the political situation is very unstable and there has even been a war, which leads to high migration rate of people fleeing out of the country.
The graph in d. clearly does not convey the whole picture. It would
be interesting also to look at the subcontinents, as it is likely that a
lot of migration flows happen within the continent.
• Investigate the net migration in different subcontinents using again
parallel boxplots. Group the boxplots by continent (hint: use facet_grid
with scales = “free_x”).
• Remember to prettify the plot (rotate axis labels if needed).
• Describe what you see.
ggplot(asia_subset, aes(x = region, y = net_migration_rate)) +
geom_boxplot(outlier.color = "red", outlier.shape = 16, outlier.size = 2) +
facet_grid(. ~ continent, scales = "free_x") +
labs(title = "Distribution of Net Migration Rate by asian subcontinents",
x = "Subcontinents",
y = "Net Migration Rate") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
europe_subset <- subset(df, continent== 'europe')
ggplot(europe_subset, aes(x = region, y = net_migration_rate)) +
geom_boxplot(outlier.color = "red", outlier.shape = 16, outlier.size = 2) +
facet_grid(. ~ continent, scales = "free_x") +
labs(title = "Distribution of Net Migration Rate by european subcontinents",
x = "Subcontinents",
y = "Net Migration Rate") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
africa_subset <- subset(df, continent== 'africa')
ggplot(africa_subset, aes(x = region, y = net_migration_rate)) +
geom_boxplot(outlier.color = "red", outlier.shape = 16, outlier.size = 2) +
facet_grid(. ~ continent, scales = "free_x") +
labs(title = "Distribution of Net Migration Rate by african subcontinents",
x = "Subcontinents",
y = "Net Migration Rate") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
For the Asian subcontinent plot eastern asia, south-eastern asia and souther asia have a very short interquartile range (IQR) around zero. Eastern asia stands out as the only subcontinent with a clearly positive median. Central asia is the only subcontinent where most of the countries have a negative migration rate. Western asia has the highest IQR with the longest whiskers from appr. 10 to -10 net migration rate and also the highest positive outlier of appr. 15, which makes it the most interesting to investigate this further. An assumption can be made that this subcontinent consists of the most countries.
For Europe, there are four subcontinents all having a median and almost an IQR between zero and five meaning there are more people migrating than leaving the subcontinents. Only northern europe stands out with a negative net migration rate until -5, seen by their whiskers. Eastern europe has one outlier, which must be the country with the lowest net migration rate. Western europe has the other outlier of this plot, which has to be the country with the highest net migration rate.
For Africa, there are five subcontients. Middle africa has two outliers. One with the highest and one with the lowest net migration rate overall in Africa. Middle and northern africa have a very small IQR and almost no whiskers, staying between 0 and -2. Western and eastern africa have relatively bigger IQR than the one before filling up the whole range of 0 to -2. Again, there is one subcontinent namely southern africa with a high IQR comparatively with a range from 0 to -6. Its median though is close to 0, which means that the distribution of this subcontinent is left-skewed and there are many countries with a net migration rate of around zero.
The plot in task e. shows the distribution of the net migration rate
for each subcontinent. Here you will work on visualizing only one
summary statistic, namely the median.
For each subcontinent, calculate the median net migration rate. Then
create a plot which contains the sub-regions on the y-axis and the
median net migration rate on the x-axis.
• As geoms use points.
• Color the points by continent – use a colorblind friendly palette (see
e.g., here).
• Rename the axes.
• Using fct_reorder from the forcats package, arrange the levels of
subcontinent such that in the plot the lowest (bottom) subcontinent
contains the lowest median net migration rate and the upper most region
contains the highest median net migration rate.
• Comment on the plot. E.g., what are the regions with the most influx?
What are the regions with the most outflux?
As shown in the plot below, the subcontinent with the highest median net migration rate i.e. the most influx is Australia and New Zealand with a median of around 7.5. The subcontinent with the lowest median net migration rate i.e. the most outflux is Micronesia with a median of around -7.5. The continent Oceania has both the subcontinents with the most influx and outflux, which seems quite interesting and might be a correlation of people of one subcontinent (Micronesia) migrating to the other subcontinent (Australia and New Zealand).
library(forcats)
# Rename variable region to subcontinent
df$subcontinent <- df$region
df$region<-NULL
# Add the colourblind palette with black:
cbbPalette <- c("#000000", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
median_migration <- df %>%
group_by(subcontinent, continent) %>%
summarize(median_net_migration_rate = median(net_migration_rate, na.rm = TRUE))
## `summarise()` has grouped output by 'subcontinent'. You can override using the
## `.groups` argument.
ggplot(median_migration, aes(x = median_net_migration_rate, y = fct_reorder(subcontinent, median_net_migration_rate), color = continent)) +
geom_point() +
labs(x = "Median Net Migration Rate", y = "Subcontinents") +
theme_minimal() +
scale_color_manual(values = cbbPalette)
For each subcontinent, calculate the median youth unemployment rate.
Then create a plot which contains the sub-regions on the y-axis and the
median unemployment rate on the x-axis.
• Use a black and white theme (?theme_bw())
• As geoms use bars. (hint: pay attention to the statistical
transformation taking place in geom_bar() – look into argument
stat=“identity”)
• Color the bars by continent – use a colorblind friendly palette.
• Make the bars transparent (use alpha = 0.7).
• Rename the axes.
• Using fct_reorder from the forcats package, arrange the levels of
subcontinent such that in the plot the lowest (bottom) subcontinent
contains the lowest median youth unemployment rate and the upper most
region contains the highest median youth unemployment rate.
• Comment on the plot. E.g., what are the regions with the highest vs
lowest youth unemployment rate?
As shown in the plot below, the subcontinent with the highest median youth unemployment rate is Southern Africa with a median of almost 40. The subcontinent with the lowest median youth unemployment rate is Eastern Asia with a median of around 8. Here, the colorblind friendly palette is used to color the bars by continent, but to me makes it more difficult to distinguish the continents from each other than it was before.
# Calculate median youth unemployment rate for each subcontinent
median_unemployment <- df %>%
group_by(subcontinent, continent) %>%
summarize(youth_unemployed_rate = median(youth_unemployed_rate, na.rm = TRUE))
ggplot(median_unemployment, aes(x = youth_unemployed_rate, y = fct_reorder(subcontinent, youth_unemployed_rate), color = continent)) +
geom_bar(stat = "identity") +
labs(x = "Median Youth Unemployment Rate", y = "Subcontinent") +
theme_bw() +
scale_color_manual(values = cbbPalette)
The value displayed in the barplot in g. is the result of an
aggregation, so it might be useful to also plot error bars, to have a
general idea on how precise the median unemployment is. This can be
achieved by plotting the error bars which reflect the standard deviation
or the interquartile range of the variable in each of the
subcontinents.
Repeat the plot in h. but include also error bars which reflect the 25%
and 75% quantiles. You can use geom_errorbar in ggplot2.
# Task h: Median youth unemployment rate per subcontinent – with error bars
unemployment_stats <- df %>%
group_by(subcontinent, continent) %>%
summarize(median_youth_unemployment_rate = median(youth_unemployed_rate, na.rm = TRUE),
Q1 = quantile(youth_unemployed_rate, 0.25, na.rm = TRUE),
Q3 = quantile(youth_unemployed_rate, 0.75, na.rm = TRUE))
## `summarise()` has grouped output by 'subcontinent'. You can override using the
## `.groups` argument.
ggplot(unemployment_stats, aes(x = median_youth_unemployment_rate, y = fct_reorder(subcontinent, median_youth_unemployment_rate), fill = continent)) +
geom_bar(stat = "identity", alpha = 0.5) +
geom_errorbar(aes(xmin = Q1, xmax = Q3), width = 0.2) +
labs(x = "Median Youth Unemployment Rate", y = "Subcontinent") +
theme_bw() +
scale_fill_manual(values = cbbPalette)
Using ggplot2, create a plot showing the relationship between median
age and net migration rate.
• Color the geoms based on the income status.
• Add a regression line for each development status (using
geom_smooth()).
Comment on the plot. Do you see any relationship between the two
variables? Do you see any difference among the income levels?
For the regression lines it can be seen that the low income countries
have a positive slope, which means that the net migration rate increases
with increasing median age. This means that the younger the population
of a low income country the more people are leaving the country and the
older the population of a low income country the more the people stay or
migrate into this country.
Besides that, one can not really tell a difference between the income
levels since all regression lines are almost parallel to each other.
This means that the net migration rate stays the same with increasing
median age for the other three income levels.
For the data points we can obserce that the low income countries and
lower middle income countries have rather a median age between 15 and 30
years and most often a negative net migration rate between -10 and
0.
On the opposite, high income countries have a median age between 30 to
50 years and most often a positive net migration rate between 0 and
10.
# Relationship between median age and net migration rate
ggplot(df, aes(x = median_age, y = net_migration_rate, color = income_group)) +
geom_point() +
geom_smooth(method = "lm") +
labs(x = "Median Age", y = "Net Migration Rate") +
theme_minimal() +
scale_color_manual(values = cbbPalette)
## `geom_smooth()` using formula = 'y ~ x'
Create a plot as in Task i. but for youth unemployment and net migration rate. Comment briefly.
Here, we can observe that the low income countries have a positive
slope, which means that the net migration rate increases with increasing
youth unemployment rate, which does not make really sense. This means
that the higher the youth unemployment rate of a low income country the
more people are coming to the country. This must be due to the fact that
some countries have a very high outflux but very low youth unemployment
rate.
The other three income levels have a negative slope, which means that
the net migration rate decreases with increasing youth unemployment
rate. This means that the higher the youth unemployment rate of a
country the more people are leaving the country, which makes more
sense.
For the data points we can not observe a clear pattern between the youth
unemployment rate and the net migration rate. The data points are very
scattered and do not show a clear relationship between the two variables
besides what was already mentioned.
# Relationship between youth unemployment and net migration rate
ggplot(df, aes(x = youth_unemployed_rate, y = net_migration_rate, color = income_group)) +
geom_point() +
geom_smooth(method = "lm") +
labs(x = "Youth Unemployment Rate", y = "Net Migration Rate") +
theme_minimal() +
scale_color_manual(values = cbbPalette)
## `geom_smooth()` using formula = 'y ~ x'
Go online and find a data set which contains the 2020 population for the countries of the world together with ISO codes.
The data was obtained from the World Bank in the link https://data.worldbank.org/indicator/SP.POP.TOTL. The iso_code was also obtained from the link https://www.cia.gov/the-world-factbook/references/country-data-codes/.
# Load the population data
population_data <- read_csv("./API_SP.POP.TOTL_DS2_en_csv_v2_45183.csv", skip = 4)
population_data <- population_data %>%
select(`Country Code`, `2020`) %>%
rename(iso_code = `Country Code`, population = `2020`)
# merge
country_codes <- read_csv("./Country_Data_Codes.csv") %>%
select(Name, GENC) %>% rename(country = Name, iso_code = GENC)
# don't keep the right country
df <- df %>% left_join(country_codes, by = c("country"))
df <- df %>% left_join(population_data, by = c("iso_code"))
print("Are there missing values in the population column?")
print(any(is.na(df$population)))
ggplot(df, aes(x = population)) +
geom_histogram() +
labs(x = "Population") +
scale_x_continuous(limits = c(0, 5e+8)) +
labs(y = "Quantity of countries") +
theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## [1] "Are there missing values in the population column?"
## [1] FALSE
Make a scatterplot of median age and net migration rate for the countries of Europe. * Scale the size of the points according to each country’s population.
# Scatterplot of median age and net migration rate in Europe
df %>%
filter(continent == "europe") %>%
ggplot(aes(x = median_age, y = net_migration_rate)) +
geom_point(aes(size = population), alpha = 0.7) +
labs(x = "Median Age", y = "Net Migration Rate", title = "Net Migration Rate Vs. Median age") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5)) +
theme(legend.position = "none")
The graph above shows a positive values of net migration rate for majority of the countries in Europe. The median age in most of the countries are between 40 and 45 year with the bigger population countries in this selection have a median age closer to 40 years.
On the merged data set from Task k., using function ggplotly from package plotly re-create the scatterplot in Task l., but this time for all countries. Color the points according to their continent.
When hovering over the points the name of the country, the values for
median age, net migration rate, and population should be shown. (Hint:
use the aesthetic text = Country. In ggplotly use the argument
tooltip = c("text", "x", "y", "size")).
p <- df %>%
ggplot(aes(x = median_age, y = net_migration_rate, text = country)) +
geom_point(aes(size = population, fill = continent), alpha = 0.7) +
labs(x = "Median Age", y = "Net Migration Rate", title = "Net Migration Rate Vs. Median age") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5)) +
theme(legend.position = "none")
ggplotly(p, tooltip = c("country", "median_age", "net_migration_rate", "population"))
In parallel coordinate plots each observation or data point is depicted as a line traversing a series of parallel axes, corresponding to a specific variable or dimension. It is often used for identifying clusters in the data.
One can create such a plot using the GGally R package. You should create such a plot where you look at the three main variables in the data set: median age, youth unemployment rate and net migration rate. Color the lines based on the income status. Briefly comment
# Parallel coordinate plot
ggparcoord(df, columns = match(c("median_age", "youth_unemployed_rate", "net_migration_rate"), names(df)), groupColumn = "income_group", scale = "center", showPoints = TRUE, title = "Parallel Coordinate Plot - Centered")
# Parallel coordinate plot
ggparcoord(df, columns = match(c("median_age", "youth_unemployed_rate", "net_migration_rate"), names(df)), groupColumn = "income_group", scale = "globalminmax", showPoints = TRUE, title = "Parallel Coordinate Plot - Min Max")
In the parallel coordinate plot, we can see that the countries with higher income have a higher median age and the count of youth unemployment rate is lower and higher net migration rate. The countries with lower income have a lower median age, distributed youth unemployment rate and almost zero net migration rate, what can mean the population is unable to leave.
Using the package rworldmap, create a world map of the median age per country. Use the vignette https://cran.r-project.org/web/packages/rworldmap/vignettes/rworldmap.pdf to find how to do this in R.
sPDF <- joinCountryData2Map(df, joinCode = "ISO3", nameJoinColumn = "iso_code")
## 175 codes from your data successfully matched countries in the map
## 0 codes from your data failed to match with a country code in the map
## 68 codes from the map weren't represented in your data
# Plot the map
mapParams <- mapCountryData(sPDF, nameColumnToPlot = "median_age", mapTitle = "Median Age per Country")
# Adjust the legend
do.call(addMapLegend, c(mapParams, legendWidth = 0.5))